Geometric Methods for Robust Data Analysis in High Dimension

نویسنده

  • Joseph Anderson
چکیده

Data-driven applications are growing. Machine learning and data analysis now finds both scientific and industrial application in biology, chemistry, geology, medicine, and physics. These applications rely on large quantities of data gathered from automated sensors and user input. Furthermore, the dimensionality of many datasets is extreme: more details are being gathered about single user interactions or sensor readings. All of these applications encounter problems with a common theme: use observed data to make inferences about the world. Our work obtains the first provably efficient algorithms for Independent Component Analysis (ICA) in the presence of heavy-tailed data. The main tool in this result is the centroid body (a well-known topic in convex geometry), along with optimization and random walks for sampling from a convex body. This is the first algorithmic use of the centroid body and it is of independent theoretical interest, since it effectively replaces the estimation of covariance from samples, and is more generally accessible. We demonstrate that ICA is itself a powerful geometric primitive. That is, having access to an efficient algorithm for ICA enables us to efficiently solve other important problems in machine learning. The first such reduction is a solution to the open problem of efficiently learning the intersection of n + 1 halfspaces in R, posed in [43]. This reduction relies on a non-linear transformation of samples from such an intersection of halfspaces (i.e. a simplex ) to samples which are approximately from a

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

Robust Principal Component Analysis and Fractal Methods to Delineate Mineralization-Related Hydrothermally-Altered Zones from ASTER Data: A Case Study of Dehaj Terrain, Central Iran

The Dehaj area, located in the southern part of the Urumieh-Dokhtar magmatic belt, is a well-endowed terrain hosting a number of world-class porphyry copper deposits. These deposits are all hosted in an acidic to intermediate volcano-plutonic sequence greatly affected by various types of the hydrothermal alterations, whether argillic, phyllic or propylitic. Although there are a handful of hithe...

متن کامل

Multiscale Geometric Methods for Estimating Intrinsic Dimension

We present a novel approach for estimating the intrinsic dimension of certain point clouds: we assume that the points are sampled from a manifold M of dimension k, with k << D, and corrupted by D-dimensional noise. When M is linear, one may analyze this situation by PCA: with no noise one would obtain a rank k matrix, and noise may be treated as a perturbation of the covariance matrix. WhenM is...

متن کامل

A robust aggregation operator for multi-criteria decision-making method with bipolar fuzzy soft environment

Molodtsov initiated soft set theory that provided a general mathematicalframework for handling with uncertainties in which we encounter the data by affix parameterized factor during the information analysis as differentiated to fuzzy as well as bipolar fuzzy set theory.The main object of this paper is to lay a foundation for providing a new application of bipolar fuzzy soft tool in ...

متن کامل

Mammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease

Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...

متن کامل

The Use of Robust Factor Analysis of Compositional Geochemical Data for the Recognition of the Target Area in Khusf 1:100000 Sheet, South Khorasan, Iran

The closed nature of geochemical data has been proven in many studies. Compositional data have special properties that mean that standard statistical methods cannot be used to analyse them. These data imply a particular geometry called Aitchison geometry in the simplex space. For analysis, the dataset must first be opened by the various transformations provided. One of the most popular of the a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1705.09269  شماره 

صفحات  -

تاریخ انتشار 2017